Search CORE

32 research outputs found

Automatic Discrimination of Human and Neural Machine Translation:A Study with Multiple Pre-Trained Models and Longer Context

Author: Toral Antonio
van der Werff Tobias
van Noord Rik
Publication venue: European Association for Machine Translation
Publication date: 01/06/2022
Field of study

We address the task of automatically distinguishing between human-translated (HT) and machine translated (MT) texts. Following recent work, we fine-tune pre-trained language models (LMs) to perform this task. Our work differs in that we use state-of-the-art pre-trained LMs, as well as the test sets of the WMT news shared tasks as training data, to ensure the sentences were not seen during training of the MT system itself. Moreover, we analyse performance for a number of different experimental setups, such as adding translationese data, going beyond the sentence-level and normalizing punctuation. We show that (i) choosing a state-of-the-art LM can make quite a difference: our best baseline system (DeBERTa) outperforms both BERT and RoBERTa by over 3% accuracy, (ii) adding translationese data is only beneficial if there is not much data available, (iii) considerable improvements can be obtained by classifying at the document-level and (iv) normalizing punctuation and thus avoiding (some) shortcuts has no impact on model performance

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Automatic Discrimination of Human and Neural Machine Translation:A Study with Multiple Pre-Trained Models and Longer Context

Author: Toral Antonio
van der Werff Tobias
van Noord Rik
Publication venue: European Association for Machine Translation
Publication date: 01/06/2022
Field of study

ARTS repository - University of Groningen

Automatic Discrimination of Human and Neural Machine Translation:A Study with Multiple Pre-Trained Models and Longer Context

Author: Toral Antonio
van der Werff Tobias
van Noord Rik
Publication venue: European Association for Machine Translation
Publication date: 01/06/2022
Field of study

Proceedings - University of Groningen

Automatic Discrimination of Human and Neural Machine Translation:A Study with Multiple Pre-Trained Models and Longer Context

Author: Toral Antonio
van der Werff Tobias
van Noord Rik
Publication venue: European Association for Machine Translation
Publication date: 01/06/2022
Field of study

Dissertations of the University of Groningen

Writer adaptation for offline text recognition: An exploration of neural network-based methods

Author: Dhali Maruf A.
Schomaker Lambert
van der Werff Tobias
Publication venue
Publication date: 11/07/2023
Field of study

Handwriting recognition has seen significant success with the use of deep learning. However, a persistent shortcoming of neural networks is that they are not well-equipped to deal with shifting data distributions. In the field of handwritten text recognition (HTR), this shows itself in poor recognition accuracy for writers that are not similar to those seen during training. An ideal HTR model should be adaptive to new writing styles in order to handle the vast amount of possible writing styles. In this paper, we explore how HTR models can be made writer adaptive by using only a handful of examples from a new writer (e.g., 16 examples) for adaptation. Two HTR architectures are used as base models, using a ResNet backbone along with either an LSTM or Transformer sequence decoder. Using these base models, two methods are considered to make them writer adaptive: 1) model-agnostic meta-learning (MAML), an algorithm commonly used for tasks such as few-shot classification, and 2) writer codes, an idea originating from automatic speech recognition. Results show that an HTR-specific version of MAML known as MetaHTR improves performance compared to the baseline with a 1.4 to 2.0 improvement in word error rate (WER). The improvement due to writer adaptation is between 0.2 and 0.7 WER, where a deeper model seems to lend itself better to adaptation using MetaHTR than a shallower model. However, applying MetaHTR to larger HTR models or sentence-level HTR may become prohibitive due to its high computational and memory requirements. Lastly, writer codes based on learned features or Hinge statistical features did not lead to improved recognition performance.Comment: 21 pages including appendices, 6 figures, 10 table

arXiv.org e-Print Archive

MaCoCu:Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Author: Bañón Marta
Esplà-Gomis Miquel
Forcada Mikel L.
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
Ramírez-Sánchez Gema
Rupnik Peter
Sempere Leopoldo Pla
Suchomel Vít
Toral Antonio
van der Werff Tobias
van Noord Rik
Zaragoza Jaume
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.</p

Proceedings - University of Groningen